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Abstract 


A  time-marching,  Navier-Stokes  code,  successfully  used  over  a  decade  for  projectile 
aerodynamics,  was  chosen  as  a  test  case  and  optimized  to  run  on  modem  reduced  instruction  set 
computer  (RlSC)-based  parallel  computers.  The  parallelized  version  of  the  code  has  been  used 
to  compute  the  axisymmetric  and  three-dimensional  (3-D)  turbulait  flow  over  a  number  of 
projectile  configurations  at  transonic  and  supersonic  speeds.  In  most  of  these  cases,  these  results 
were  then  compared  to  those  obtained  with  the  original  version  of  the  code  on  a  Cray  C-90.  Both 
versions  of  the  code  produced  the  same  qualitative  and  quantitative  results.  Considerable 
performance  gain  was  achieved  by  the  optimization  of  the  serial  code  on  a  single  processor. 
Parallelization  of  the  optimized  serial  code,  which  uses  loop-level  parallelism,  led  to  additional 
gains  in  performance.  The  original  algorithm  remained  unchanged.  Recent  runs  on  a 
128-processor  Origin  2000  have  produced  speedups  in  the  range  of  10-26  over  that  achieved 
when  using  a  single  processOT  on  a  Cray  C-90.  The  original  algorithm  remained  unchanged. 
Computed  surface  pressures  were  compared  with  the  experimental  data  and  were  generally  found 
to  be  m  good  agreement  with  the  data. 
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1.  Introduction 


Advancements  in  conputer  technology  and  state-of-the-art  numerical  procedures  enable  one 
to  find  solutions  to  con^lex  time-dependent  problems  associated  with  projectile  aerodynamics, 
store  separation  from  fighter  planes,  and  other  multibody  systems.  Application  of  con5)utational 
fluid  dynamics  (CFD)  to  multibody  configurations  has  proven  to  be  a  valuable  tool  in  evaluating 
potential  new  designs.  Although  the  con5)utational  results  obtained  are  encouraging  and 
valuable,  the  computer  central  processing  unit  (CPU)  time  required  for  each  time-dependent 
calculation  is  immense,  even  for  axisymmetric  flows,  with  three-dimensional  (3-D)  calculations 
being  worse.  This  problem  becomes  even  more  extreme  when  one  looks  at  the  turnaround  time. 
These  times  must  be  reduced  at  least  an  order  of  magnitude  before  this  technology  can  be  used 
routinely  for  the  design  of  multibody  projectile  systems.  This  is  also  true  for  numerical 
simulation  of  single  projectile-missile  configurations,  which  are,  at  times,  quite  con^lex  and 
require  large  computing  resources.  The  primary  technical  challenge  is  to  effectively  utilire  new 
advances  in  computer  technology  in  order  to  significantly  reduce  run  time  and  to  achieve  the 
desired  improvements  in  the  turnaround  time. 

The  U.S.  Department  of  Defense  (DOD)  is  actively  upgrading  its  high-performance 
computing  (HPC)  resources  through  the  DOD  High-Performance  Con5)uting  Modernization 
Program  (HPCMP).  The  goal  of  this  program  is  to  provide  the  research  scientists  and  engineers 
with  the  best  con^utational  resources  (networking,  mass  storage,  and  scientific  visualization)  for 
improved  design  of  weapon  systems.  The  program  is  designed  to  procure  state-of-the-art 
computer  systems  and  support  environments.  One  of  the  initiatives  of  the  DOD  HPCMP  is  the 
Common  High-Performance  Computing  Software  Support  Initiative  aimed  at  developing 
application  software  for  use  with  systems  being  installed.  This  program  covers  10  computational 
technology  areas  (CTAs)  that  have  been  deemed  crucial  in  the  DOD  science  and  engineering 
community.  One  of  the  CTAs  is  CFD.  A  major  portion  of  this  effort  has  to  do  with  developing 
software  to  run  on  the  new  scalable  systems,  since  much  of  the  existing  code  was  developed  for 
vector  systems.  One  of  the  codes  that  was  selected  for  this  effort  is  the  F3D  [1,2]  code,  which 
was  originally  developed  at  NASA  Ames  Research  Center  with  subsequent  modifications  made 
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at  the  U.S.  Army  Research  Laboratory  (ARL).  This  code  is  a  Navier-Stokes  solver  capable  of 
performing  implicit  and  explicit  calculations.  It  has  been  extensively  validated  and  calibrated  for 
many  applications  in  the  area  of  projectile  aerodynamics  for  over  a  decade.  As  such,  there  was  a 
strong  interest  in  porting  this  code  to  the  new  environments.  A  key  reason  for  choosing  this  flow 
solver  is  its  proven  ability  to  coni^ute  the  flow  field  for  projectile  configurations  using 
Navier-Stokes  computational  techniques  [3-6].  The  same  flow  solver  has  also  been  used  to 
compute  3-D  flow  over  various  spinning  and  nonspinning  projectile  configurations.  Con:5)uted 
result  (including  axial  force,  normal  force,  pitching  moment,  and  Magnus  force  and  moment) 
obtained  with  this  code  compared  favorably  with  experimental  and  flight  test  data. 

The  key  breakthrough  was  the  realization  that  many  of  the  new  systems  seemed  to  lend 
themselves  to  the  use  of  loop-level  parallelism.  This  strategy  offered  the  promise  of  allowing  the 
code  to  be  parallelized  with  absolutely  no  changes  to  the  algorithm.  This  paper  describes  the 
solution  techmque,  parallelization  of  the  code,  and  its  application  to  Army  projectile 
configurations. 

2.  Solution  Technique 

2.1  Governing  Equations.  The  conplete  set  of  3-D,  time-dependent,  generalized  geometry, 
Reynolds-averaged,  thin-layer  Navier-Stokes  equations  is  solved  numerically  to  obtain  a  solution 
to  this  problem  and  can  be  written  in  general  spatial  coordinates  t),  and  ^  as  follows  [1]: 

d^q+3|F+3,jG-i-3^H  =  Re  ,  (1) 

In  equation  (1),  q  contains  the  dependent  variables:  density,  three  velocity  components,  and 
energy.  The  thin-layer  approximation  is  used  here,  and  the  viscous  terms  involving  velocity 
gradients  in  both  the  longitudinal  and  circumferential  directions  are  neglected.  The  viscous 

terms  are  retained  in  the  normal  direction,  ^  and  are  collected  into  the  vector  S .  Similar 
thin-layer  approximation  is  also  used  in  the  other  directions  when  needed. 
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22  Numerical  Technique.  The  implicit,  approximately  factored  scheme  for  the  thin-layer 
Navier-Stokes  equations  using  central  differencing  in  the  t|  and  ^  directions  and  upwinding  in  \ 
is  written  in  the  following  form  [1]: 

[l  +  i,hS|  (a*)*  +  ibhScC"  -  i^hRe-'Sj  J"'  M'  J  -  i,D,|^  j 
X  [l  +  i,h5^^-)"  +  i,h8,B"  -  i,D,|^]AQ" 

=  -i,A.{8^  [(f*)”  -  f:]  +  8(  [(&-)"  -  F;]  ^  8,  (g-  -  G.) 

+  85  -  H.)  -  Re-‘  8{  I*  -  S.)}  -  i,D.  -  Q.),  (2) 

where  h  =  At  or  (At)/2and  the  free-stream  base  solution  is  used.  Here,  5  is  typically  a  three-point 
second-order-accurate  central  difference  operator,  5  is  a  midpoint  operator  used  with  die  viscous 
terms,  and  the  operators  5|  and  8|  are  backward  and  forward  three-point  difference  operators. 

The  flux  F  has  been  eigensplit,  and  the  matrices  A,  B,  C,  and  M  result  from  local 
linearization  of  the  fluxes  about  the  previous  time  level.  Here,  J  denotes  the  Jacobian  of  the 
coordinate  transformation.  Dissipation  operators  De  and  Di  are  used  in  the  central  space 
differencing  directions. 

23  Chimera  Composite  Grid  Technique.  The  Chimera  [7-9]  overset  grid  scheme  is  a 
domain  decomposition  approach  where  a  full  configuration  is  meshed  using  a  collection  of 
indepaident  ovarset  grids.  This  allows  each  component  of  the  configuration  to  be  gridded 
separately  and  overset  into  a  main  grid.  Overset  grids  are  not  required  to  join  in  any  special  way. 
Usually,  a  major  grid  covers  die  aitire  domain  or  a  grid  is  generated  about  a  dominant  body 
section.  Minor  grids  are  generated  about  the  rest  of  the  bodies  or  sections.  Because  each 
component  grid  is  generated  independently,  portions  of  one  grid  may  be  found  to  lie  within  a 
solid  boundary  contained  within  another  grid.  Such  points  lie  outside  the  computational  domain 
and  are  excluded  from  the  solution  process.  Equation  (2)  has  been  modified  for  Chimera  overset 
grids  by  the  introduction  of  the  flag  it  to  achieve  just  that.  This  ib  array  accommodates  the 
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possibility  of  having  arbitrary  holes  in  the  grid.  The  ib  array  is  defined  such  that  ib  =  1  at  normal 
grid  points  and  ib  =  0  at  hole  points.  Thus,  when  ib  =  1,  equation  (2)  becomes  the  standard 
scheme.  But,  when  ib  =  0,  the  algorithm  reduces  to  AQ”  =0  or  =Q”,  leaving  Q 

unchanged  at  hole  points.  The  set  of  grid  points  that  form  the  border  between  the  hole  points  and 
the  normal  field  points  are  called  intergrid  boundary  points.  These  points  are  updated  by 
interpolating  the  solution  from  the  overset  grid  that  created  the  hole.  Values  of  the  ib  array  and 
the  interpolation  coefficients  needed  for  this  update  are  provided  by  a  separate  algorithm  [7]. 
The  Chimera  procedure  reduces  a  complex  problem  into  a  number  of  simple  subproblems. 
Computations  are  performed  on  each  grid  separately.  A  major  part  of  the  Chimera  overset  grid 
approach  is  the  information  transfer  from  one  grid  into  another  by  means  of  the  intergrid 
boundary  points. 

2.4  Boundary  Conditions.  For  simplicity,  most  of  die  boundary  conditions  have  been 
imposed  explicitly.  An  adiabatic  wall  boundary  condition  is  used  on  the  body  surface,  and  the 
no-slip  boundary  condition  is  used  at  the  wall.  The  pressure  at  the  waU  is  calculated  by  solving  a 
combined  momentum  equation.  Free-stream  boundary  conditions  are  used  at  the  inflow 
boundary  as  well  as  at  the  outer  boundary.  A  synunelry  boundary  condition  is  imposed  at  the 
circumferential  edges  of  the  grid,  while  a  simple  extrapolation  is  used  at  the  downstream 
boundary.  A  combination  of  symmetry  and  extrapolation  boundary  condition  is  used  at  the 
center  line  (axis).  For  supersonic  flows,  a  nonreflection  boundary  condition  is  used  at  the  outer 
boundary.  For  overset  grids,  the  outer  boundary  of  the  component  grids  completely  lies  within 
the  background  projectile  grid  and,  thus,  gets  its  flow-field  information  interpolated  firom  the 
projectile  grid. 

3.  Parallelization  Methodology 

Many  modem  parallel  computers  are  now  based  on  high-performance  reduced  instruction  set 
computer  (RISC)  processors.  There  are  two  important  conclusions  that  one  can  reach  from  this 
observation:  (1)  in  theory,  there  are  many  cases  in  which  it  will  no  longer  be  necessary  to  use 


4 


over  100  processors  in  order  to  meet  the  user’s  needs  dnd,  (2)  if  the  theory  is  to  be  met,  one  must 
achieve  a  reasonable  percentage  of  the  peak  processing  speed  of  the  processors  being  used. 
Additionally,  the  first  conclusion  allows  for  the  use  of  alternative  architectures  and 
parallelization  techniques  that  might  support  only  a  limited  degree  of  parallelism  (e.g.,  10-100 
processors).  Based  on  this  leevaluation,  some  important  conclusions  were  reached. 

(1)  In  using  traditional  parallel  algorithms  and  techniques,  using  significantly  fewer 
processors  can 

(a)  decrease  die  system  cost, 

(b)  increase  the  reliability  of  the  system, 

(c)  decrease  the  extent  to  which  the  efficiency  of  the  algorithm  is  degraded, 

(d)  decrease  the  percentage  of  the  run  time  spent  passing  messages,  and 

(e)  decrease  the  effect  of  Amdahl’s  Law. 

(2)  Possibly  of  even  greater  significance  was  the  observation  that,  widi  loop-level 
parallelism,  it  is  possible  to  avoid  many  of  the  problems  associated  with  parallel 
programming  altogether.  This  is  not  a  new  observation,  but  only  now  is  it  starting  to  be 
a  useful  one.  The  key  things  that  changed  are  that 

(a)  loop-level  parallelism  is  fi:equentiy  restricted  to  using  modest  numbers  of 
processors  and  the  processors  therefore  have  to  be  fast  enough  to  achieve  an 
accqitable  level  of  performance; 

(b)  loop-level  parallelism  will,  in  general,  try  and  use  the  same  sources  of  parallelism 
used  to  produce  a  vectorizable  code  (this  makes  it  difficult  to  efficiendy  use  this 
type  of  parallelism  on  a  machine  equipped  with  vector  processors);  and 
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(c)  it  is  difficult  to  make  efficient  use  of  loop-level  parallelism  on  anything  but  a 
shared-memory  architecture,  and,  only  recently,  have  vendors  started  to  ship 
shared-memory  architectures  based  on  RISC  processors  with  aggregate  peak  speeds 
in  excess  of  a  few  giga-floating-point  operations  per  second  (GFLOPS). 

By  combining  aggressive  serial  optimizations  with  loop-level  parallelization  of  vectorizable 
loops  (some  loop  interchanging  was  also  required),  all  of  the  design  goals  were  met. 

4.  Results 

4.1  Supersonic  Flow  Over  a  Missile  Body.  A  generic  missile  configuration  was  used  for 
many  of  the  tests  on  the  parallelized  code.  In  these  tests,  a  one-million-point  grid  (see  Figure  1) 
was  used  to  check  die  accuracy  of  the  results.  The  computed  results  obtained  with  the 
parallelized  code  were  compared  with  those  obtained  using  the  vectorized  code  on  a  Cray  C-90. 
These  computed  results  were  compared  with  the  experimental  data  obtained  at  the  Defense 
Research  Agency  (DRA)  [10],  UK,  for  the  same  configuration  and  test  conditions.  Typically, 
computation  on  the  C-90  used  18  MW  (148  MB)  of  memory  and  7.5  hr  of  CPU  time.  Once  the 
accuracy  of  the  computed  result  was  verified,  performance  studies  were  carried  out  for  grid  sizes 
ranging  from  1  to  53  million  grid  points.  Figure  2  shows  the  computed  pressure  contours  for 
Mach  number,  M  =  2.5  and  angle  of  attack,  a  =  14°  for  the  1 -million-grid-point  case.  It  shows 
the  computed  pressure  contours  for  botii  windside  (bottom)  and  leeside  (top).  Computed 
pressures  were  obtained  at  1,800  time  steps  using  both  the  Power  Challenge  Array  (PCA)  and  the 
C-90.  Both  solutions  produce  identical  results  and  show  the  expected  shock  wave  flow  features. 
Figure  3  shows  the  circumferential  surface  pressure  distributions  of  the  missile  at  a  selected 
longitudinal  station.  Computed  results  from  both  versions  of  the  code  are  shown  to  lie  on  top  of 
one  another  and,  thus,  are  in  excellent  agreement.  Both  computed  and  experimental  results  show 
the  same  trends  (i.e.,  higher  surface  pressure  on  the  windside  and  low  pressure  on  die  leeside). 

These  results  were  obtained  using  a  highly  efficient  serial  algoritibm  as  the  starting  point, 
taking  great  care  not  to  make  any  changes  to  the  algorithm.  Initial  efforts  to  run  the  vector 
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Figure  3.  Surface  Pressure  Comparison. 

optimized  version  of  this  code  on  one  processor  of  a  Silicon  Graphics  (SGI)  Power  Challenge 
(75-MHz  R8000  processor)  proved  to  be  extremely  disappointing.  After  aggressively  tuning  the 
code  for  a  low-cache  miss  rate  and  good  pipeline  efficiency,  a  factor  of  10  improvement  in  the 
serial  performance  of  this  code  was  achieved.  At  this  point,  the  percentage  of  peak  performance 
fi’om  the  RISC-tuned  code  using  one  processor  on  the  SGI  Power  Challenge  was  the  same  as  the 
vector-tuned  code  on  one  processor  of  a  Cray  C-90.  A  key  enabling  factor  was  the  observation 
that  processors  with  a  large  external  cache  (e.g.,  1-4  MB  in  size)  could  enable  the  use  of 
optimization  strategies  that  simply  were  not  possible  on  machines  like  the  Cray  T3D  and  Intel 
Paragon,  which  only  have  16  KB  of  cache  per  processor.  This  relates  to  the  ability  to  size 
scratch  arrays  so  that  they  wiU  fit  entirely  in  the  large  external  cache.  This  can  reduce  the  rate  of 
cache  misses  associated  with  these  arrays,  which  miss  all  the  way  back  to  main  memory,  to  less 
than  0.1%  (the  comparable  cache  miss  rates  for  machines  like  the  Cray  T3D  and  Intel  Paragon 
could  easily  be  as  high  as  25%). 

While  the  effort  to  tune  the  code  was  nontrivial,  the  initial  effort  to  parallelize  the  code  was 
already  showing  good  speedup  on  12  processors  (the  maximum  number  of  processors  available 
in  one  Power  Challenge  at  that  time).  Additional  efforts  extended  this  work  to  larger  numbers  of 
processors  on  a  variety  of  systems.  Most  recently,  work  has  been  performed  on  64-  and 
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128-processor  SGI  Origin  2000s  (the  latter  is  an  experimental  system  located  at  the  Naval 
Research  Laboratory  in  Washington,  DC).  This  work  has  extended  the  range  of  problem  sizes 
up  to  53  million  grid  points  spread  between  just  three  zones  and  up  to  115  processors  on  the 
larger  machine  (due  to  the  stair-stepping  effect,  the  problem  sizes  run  on  this  machine  were  not 
expected  to  get  any  additional  benefit  from  using  1 16B128  processors). 

Figure  4  shows  the  performance  results  for  a  data  set.  All  results  have  been  adjusted  to 
remove  start-up  and  termination  costs.  The  latest  results  show  a  factor  of  900B  1,000  speedup 
from  the  original  runs  made  using  one  processor  of  the  Power  Challenge  with  the 
vector-optimized  code  (the  corresponding  increase  in  processing  power  was  less  than  a  factor  of 
160).  Additionally,  speedups  as  high  as  26.1  relative  to  one  processor  of  a  C-90  have  been 
achieved.  Since  the  numerical  algorithm  was  unchanged,  this  represents  a  factor  of  26. 1  increase 
in  the  speed  at  which  floating-point  operations  were  performed  and,  consequently,  the  wall  clock 
time  required  for  a  converged  solution  decreased  by  the  same  factor.  In  a  production 
environment,  such  as  is  found  at  the  four  Major  Shared  Resource  Centers  (MSRCs)  set  up  by  the 
DOD  HPCMP,  these  results  represent  an  opportunity  to  significantly  improve  the  job  throughput. 
These  results  clearly  demonstrate  that,  when  using  the  kinds  of  techniques  described  herein,  it  is 
possible  to  achieve  high  levels  of  performance  with  good  scalability  on  at  least  some 
RISC-based,  shared-memory,  symmetric  multiprocessors.  It  is  also  interesting  to  note  that  these 
results  were  obtained  without  the  use  of  any  assembly  code  or  system-specific  libraries  and  with 
relatively  little  help  from  the  vendors. 

4.2  Transonic  Flow  Over  a  Secant-Ogive  Cylinder-Boattail  (SOCBT)  Projectile.  The 

projectile  modeled  in  this  example  consists  of  a  three-cahber  secant-ogive  nose,  a  two-cahber 
cylinder,  and  a  one-caUber  7°  boattail.  For  this  case,  the  base  is  not  included  and  the  boattail  is 
extended  as  a  sting.  A  schematic  diagram  of  the  projectile  is  shown  in  Figure  5.  Computed 
surface  pressure  is  compared  to  experimental  surface  pressure  measurements  made  by  Kayser 
and  Whiton  [11].  The  computational  grid  used  for  this  calculation  was  obtained  using  a 
hyperbolic  grid  generator.  The  grid  consists  of  128  longitudinal  points  and  56  radial  points. 
There  are  three  planes  in  the  circumferential  direction.  The  computational  domain  extends  to 
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Figure  5.  Schematic  Diagram  of  SOCBT  Projectile. 


about  3.5  body  lengths  in  front  of  the  grid,  in  the  radial  direction,  and  behind  the  projectile.  An 
expanded  view  of  the  grid  near  the  projectile  surface  is  shown  in  Figure  6.  The  grid  points  are 
clustered  in  the  longitudinal  direction  at  the  ogive-cylinder  and  cylinder-boattail  junctions.  In 
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F^re  6.  Computational  Grid  Near  the  Projectile. 


the  normal  (radial)  direction,  the  grid  points  are  clustered  near  the  body  surfece  with  a  minimum 
spacing  of  0.00002  and  are  stretched  to  the  outer  boimdary.  Figure  7  shows  pressure  contours 
for  the  converged  solution.  It  shows  the  expansion  of  the  flow  at  the  ogive-cylinder  and 
cylinder-boattail  comers,  as  well  as  the  location  of  the  shock  wave.  Figure  8  shows  a 
comparison  of  con5)uted  surfece  pressure  with  experimental  data.  The  expansions  at  the 
projectile  comers  are  clearly  seen  in  the  con5)Utation  and  are  in  good  agreement  with  the 
experimental  data. 


43  Flow  Over  S^ments.  This  multibody  problem  involves  the  separation  of  two 
submunitions  at  a  low  transonic  speed.  Figure  9  shows  the  conq>onents  of  the  projectile 
configuration.  The  interest  here  is  the  aerodynamic  interference  effect  of  the  two  submunitions 
in  flight.  Each  submunition  is  a  right  circular  cylinder  of  length-to-diameter  ratio  of  1.38.  The 
inflow,  fer-field,  and  outflow  boundaries  are  placed  fer  enough  for  confutation  of  transonic 
flows.  The  conf  lete  grid  consists  of  three  zones,  with  approximately  83,000  grid  points.  Each 
grid  section  was  obtained  separately  and  then  appended  to  provide  the  full  grid.  The  Cartesian 
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Figure  8.  Comparison  of  Surface  Pressure. 


Figure  9.  Projectile  Configuration. 


grid,  which  forms  the  background  domain,  was  obtained  algebraically  and  consists  of  200  x  3  x 
90  points  in  the  axial,  circumferential,  and  normal  directions.  A  body-conforming  grid  (164  x 
3  X  30)  was  obtained  for  each  submunition  using  a  hyperbolic  grid  generator.  These 
submunition  grids  are  then  overset  on  the  background  Cartesian  grid  to  form  the  composite 
mesh.  An  expanded  view  of  the  composite  overset  mesh  system  is  shown  in  Figure  10  for  the 
multibody  separation  problem.  The  first  (leading)  submunition  grid  is  a  minor  grid,  as  is  the 
second  (trailing)  submunition  grid.  The  minor  grids  are  completely  overlapped  by  the  major 
grid;  thus,  their  outer  boundaries  can  obtain  information  by  interpolation  from  the  major  grid. 
Similar  data  transfer  or  communication  is  needed  from  the  minor  grids  to  the  major  grid. 
However,  a  natural  outer  boundary  that  overlaps  the  two  submunition  grids  does  not  exist.  The 
Chimera  technique  creates  an  artificial  boundary  (also  known  as  a  hole  boundary)  between  grids 
that  provides  the  required  path  for  information  transfer  from  the  minor  submunition  grids  to  the 
background  grid.  The  resulting  hole  region  is  excluded  from  the  flow-field  solution  in  the 
background  grid.  This  case  (see  Figure  10)  corresponds  to  a  separation  distance  of  one  caliber 
between  the  two  submunitions. 

Figure  11  shows  the  components  of  another  multibody  projectile  configuration,  which 
included  a  design  modification  for  the  second  submunition.  A  thin  fin  is  added  at  the  back  of  the 
second  submunition  to  provide  more  drag  during  the  separation  process.  The  same  Cartesian 
background  is  used  in  this  case.  The  second  submunition  grid  was  obtained  separately  using  a 
hyperbolic  grid  generator.  It  consists  of  228  x  3  x  30  points.  The  major  grid  or  die  background 
grid  is  easily  generated  independently  of  the  minor  grid  (the  grid  for  the  submunitions).  The 
composite  overset  mesh  system  for  this  case  is  shown  in  Figure  12.  For  moving-body  problems, 
both  minor  grids  (shown  in  Figure  12)  can  move  with  the  submunitions  as  they  separate  from 
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Figure  10.  Grids  for  Two  Submunitions. 


Figure  11.  Modified  Projectile  Configuration. 


each  other.  Again,  there  is  no  need  to  generate  new  grids  for  the  submunitions  during  the 
dynamic  process.  An  advantage  of  the  Chimera  technique  is  that  it  allows  computational  grids  to 
be  obtained  for  each  body  component  separately  and,  thus,  makes  the  grid  generation  process 
easier.  Grid  points  are  clustered  near  the  submunition  surfaces  to  capture  the  viscous  boundary 
layers. 

Numerical  computations  were  performed  for  these  configurations  at  Mach  number,  M«  = 
0.80,  and  angle  of  attack,  a  =  0°.  Results  are  presented  for  both  the  original  and  the  modified 
designs.  Figure  13  shows  the  Mach  number  contours  for  the  submunitions  for  the  original 
configuration.  The  flow  field  is  unsteady,  and  the  second  submunition  is  completely  submerged 
in  the  wake  of  the  first  submunition.  The  pressure  behind  the  first  submunition  is  lower  than  the 
pressure  ahead  of  it  and,  therefore,  as  expected,  results  in  positive  drag.  The  pressure  behind  the 
second  submunition  is,  however,  higher  than  the  pressure  ahead  of  it  and,  therefore,  results  in 
negative  drag.  Since  the  drag  for  the  first  submunition  is  positive,  it  tends  to  slow  its  motion. 
The  drag  for  the  second  submunition  is  negative,  which  results  in  it  being  pulled  back  toward  the 
first  submunition.  This  can  lead  to  undesirable  submunition  collisions.  To  avoid  the 
submunition  collision,  fins  were  added  to  the  second  submunition  to  provide  added  drag.  The 
same  Chimera  composite  overset  grid  approach  was  used  to  numerically  model  this  modified 
configuration.  Figure  14  shows  the  Mach  number  contours  for  the  submunitions  for  the 
modified  multibody  design.  As  seen  in  this  figure,  the  second  submunition  is,  again,  completely 
submerged  in  the  wake  of  the  first  submunition.  It  also  indicates  that  the  fin  affects  the  flow 
field  for  the  first  submunition.  The  drag  for  the  second  submunition  for  the  modified  design  case 
is  larger  than  that  obtained  with  the  original  design.  As  separation  distance  is  increased  between 
the  submunitions,  the  drag  for  the  second  submunition  should  similarly  go  up.  This  increase  in 
drag  for  the  finned  configuration  allows  the  submunitions  to  continually  separate  and  not  come 
back  and  collide. 

4.4  Flow  Over  a  Projectile-Sabot  System.  Another  multibody  problem  involves  the 
separation  of  sabots  from  a  projectile  (see  Figure  15).  The  aerodynamic  interference  of  the 
projectile  and  the  sabot  flow  field  is  quite  complex  and  involves  3-D  shock-boundary  layer 
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Figure  13.  Mach  Contours  (Original  Design). 


Figure  14.  Mach  Contours  (Modified  Design). 


interactions  and  separated  flow  regions.  Again,  with  the  overset  grid  approach,  it  is  possible  to 
use  different  grid  topologies  for  the  projectile  and  the  sabot  components,  respectively.  Figure  16 
shows  three  overset  meshes  for  a  sabot  angle  of  attack  of  5°,  10°,  and  15°.  The  same  sabot  grid 
used  for  the  5°  angle-of-attack  case  is  used  for  other  sabot  angles  of  attack  without  the  need  to 
regenerate  new  sabot  grids.  Figure  16  shows  computational  grids  for  the  complete  model, 
including  the  projectile  and  sabot.  The  projectile  grid  consists  of  three  zones,  with  large  number 
of  grid  points  clustered  near  the  sabot  region.  The  first  zone  is  a  C-grid,  while  zones  two  and 
three  are  rectangular  grids,  for  a  total  of  approximately  800,000  grid  points.  The  grids  around 
the  sabot  also  consist  of  three  zones  and  were  obtained  using  0-topology  and  rectangular 
topology.  The  main  sabot  grid  consists  of  94  x  109  x  39  points  in  the  axial,  circumferential,  and 
normal  directions.  The  sabot  con^onent  also  required  two  other  grids  (a  front  grid  and  an  aft 
grid).  Both  these  grids  are  rectangular  grids.  The  sabot  grids  were  individually  generated  and 
then  overset  to  form  the  complete  grid  system.  In  addition,  there  is  a  cover  grid  over  fiie  entire 
system.  The  computational  grids  shown  here  correspond  to  the  pitch  plane.  The  projectile  grid 
serves  as  the  main  background  grid  for  the  computation.  Steady-state  numerical  calculations 
have  been  performed  for  the  projectile- sabot  system  at  M*  =  4.0  and  a  =  0°.  Computational 
modeling  is  restricted  to  the  symmetric  sabot  discard  to  save  conq)uter  resources  and  time.  The 
projectile  is  at  zero  angle  of  attack,  and  three  sabots  are  discarded  symmetrically  following  the 
same  radial  trajectory  away  fi'om  the  projectile. 

Computational  studies  have  been  conq)leted  for  sabot  angles  of  attack  of  5°,  10°,  and  15°. 
The  projectile  is  at  zero  angle  of  attack  for  these  three  cases.  As  stated  earlier,  the  background 
grid  for  the  projectile  remains  the  same.  The  sabot  grids  are,  again,  the  same  but  have  been 
moved  to  the  new  positions  and  orientations.  Figure  17  shows  the  pressure  contours  for  the 
projectile  and  sabot  in  the  symmetry  plane,  for  all  three  sabot  orientations.  This  figure  shows  the 
interactions  of  the  projectile  and  the  sabot  flow  fields  occurring  at  different  longitudinal 
locations  along  the  projectile.  The  confuted  pressure  contours  show  the  sabot  shock  impinging 
on  the  projectile,  reflecting  from  the  projectile  surface.  The  shock  impingement  results  in  a 
higher  pressure  region  on  the  projectile  surface  just  downstream  of  the  impingement  point.  As 
expected,  the  flow  behind  the  base  region  of  the  sabot  is  a  low-pressure  region.  As  the  sabot 
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Figure  16.  Gi 


angle  of  attack  is  increased,  the  sabot  shock  impingement  point  on  the  projectile  is  moved  further 
downstream.  For  die  5°  sabot  angle  of  attack,  the  sabot  shock  impinges  on  the  projectile,  reflects 
from  the  projectile  surface,  and  impinges  back  on  tihe  sabot  The  reflected  shock  from  the 
projectile  surface  is  seen  to  just  miss  the  base  of  the  sabot  for  the  10°  sabot  angle-of-attack  case 
and  is  even  further  away  from  the  sabot  base  for  the  15°  sabot  angle  of  attack  case.  The  flow 
field  in  the  base  region  of  the  sabot  is  also  seen  to  change  considerably  with  an  increase  in  sabot 
angle  of  attack. 

Figures  18  and  19  show  computed  surface  pressures  for  the  sabot  and  the  projectile, 
respectively.  These  computed  surface  pressures  correspond  to  the  pitch  plane  and  are  compared 
with  the  experimaital  data  [12].  The  computed  pressures  on  the  bottom  surface  of  the  sabot  are 
shown  in  Figure  18  and  are  generally  found  to  be  in  agreement  with  the  experimental  data. 
Some  discrepancies  do  exist  in  the  comparison  of  sabot  surface  pressure  for  the  5°  angle-of- 
attack  case.  Due  to  close  proximity  of  the  sabot  to  the  projectile,  the  flow  field  is  more 
complicated  and  includes  complex  shock-shock  and  shock-boundary  layer  interactions. 
Accurate  computation  of  the  resulting  flow  field  is  thus  more  difficult.  Here,  X/D  =  0 
corresponds  to  die  nose  of  the  projectile.  Figure  19  shows  the  surface  pressure  distributions  on 
the  projectile  in  the  pitch  plane,  for  5°,  10°,  and  15°  sabot  angle-of-attack  cases.  Computed 
results  are  shown  as  a  solid  line  and  are  compared  with  the  experimental  data  shown  in  dark 
circles.  As  seen  in  this  figure,  the  surface  pressure  is  almost  constant  on  the  nose,  which  is 
followed  by  a  pressure  drop  at  the  cylinder  junction.  This  computed  pressure  drop  at  the 
cone-cylinder  junction  agrees  well  with  the  data  at  the  10°  and  15°  sabot  angles  of  attack; 
however,  the  agreement  is  not  as  good  for  the  5°  case.  The  predicted  flow  on  the  nose  of  the 
projectile  corresponds  to  an  undisturbed  flow  upstream  of  the  shock  impingement  point  Clearly, 
the  numerical  results  do  not  show  the  same  extent  of  shock-boundary  layer  interactions  observed 
experimentally.  A  large  pressure  increase  due  to  the  shock  wave  impinging  on  the  projectile 
siuface  is  seen  in  both  computed  and  experimaital  data.  The  locations  and  magnitudes  of  the 
pressure  peaks  have  been  predicted  fairly  well. 
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Figure  18.  Sabot  Surface  Pressure  Distributions 


5.  Concluding  Remarks 


A  time-marching,  Navier-Stokes  code,  successfully  used  over  a  decade  for  projectile 
aerodynamics,  was  chosen  as  a  test  case  and  optimized  to  run  on  modem  RISC-based  parallel 
computers.  The  parallelized  version  of  the  code  has  been  used  to  compute  the  axisymmetric  and 
3-D  turbulent  flow  over  a  number  of  projectile  configurations  at  transonic  and  supersonic  speeds. 
In  most  of  these  cases,  these  results  were  then  compared  to  those  obtained  with  the  original 
version  of  the  code  on  a  Cray  C-90.  Both  versions  of  the  code  produced  the  same  qualitative  and 
quantitative  results.  Considerable  performance  gain  was  achieved  by  the  optimization  of  the 
serial  code  on  a  single  processor.  Parallelization  of  the  optimized  serial  code,  which  uses 
loop-level  parallelism,  led  to  additional  gains  in  performance.  The  original  algorithm  remained 
unchanged.  Recent  runs  on  a  128-processor  Origin  2000  have  produced  speedups  in  the  range  of 
10-26  over  that  achieved  when  using  a  single  processor  on  a  Cray  C-90.  The  original  algorithm 
remained  unchanged.  Computed  surface  pressures  were  compared  with  the  experimental  data 
and  were  generally  found  to  be  in  good  agreement  with  the  data. 
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